Goto

Collaborating Authors

 West Kazakhstan Region


When do spectral gradient updates help in deep learning?

Davis, Damek, Drusvyatskiy, Dmitriy

arXiv.org Machine Learning

Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.


SUPN: Shallow Universal Polynomial Networks

Morrow, Zachary, Penwarden, Michael, Chen, Brian, Javeed, Aurya, Narayan, Akil, Jakeman, John D.

arXiv.org Artificial Intelligence

Deep neural networks (DNNs) and Kolmogorov-Arnold networks (KANs) are popular methods for function approximation due to their flexibility and expressivity. However, they typically require a large number of trainable parameters to produce a suitable approximation. Beyond making the resulting network less transparent, overparameterization creates a large optimization space, likely producing local minima in training that have quite different generalization errors. In this case, network initialization can have an outsize impact on the model's out-of-sample accuracy. For these reasons, we propose shallow universal polynomial networks (SUPNs). These networks replace all but the last hidden layer with a single layer of polynomials with learnable coefficients, leveraging the strengths of DNNs and polynomials to achieve sufficient expressivity with far fewer parameters. We prove that SUPNs converge at the same rate as the best polynomial approximation of the same degree, and we derive explicit formulas for quasi-optimal SUPN parameters. We complement theory with an extensive suite of numerical experiments involving SUPNs, DNNs, KANs, and polynomial projection in one, two, and ten dimensions, consisting of over 13,000 trained models. On the target functions we numerically studied, for a given number of trainable parameters, the approximation error and variability are often lower for SUPNs than for DNNs and KANs by an order of magnitude. In our examples, SUPNs even outperform polynomial projection on non-smooth functions.


Priors in Time: Missing Inductive Biases for Language Model Interpretability

Lubana, Ekdeep Singh, Rager, Can, Hindupur, Sai Sumedh R., Costa, Valerie, Tuckute, Greta, Patel, Oam, Murthy, Sonia Krishna, Fel, Thomas, Wurgaft, Daniel, Bigelow, Eric J., Lin, Johnny, Ba, Demba, Wattenberg, Martin, Viegas, Fernanda, Weber, Melanie, Mueller, Aaron

arXiv.org Artificial Intelligence

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.


cae82d4350cc23aca7fc9ae38dab38ab-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their insightful comments and detailed analysis of our work. Low-rank representation of nonsymmetric DPP kernel: The first term on the right side of Eq. 12 will be singular Regarding the time complexity of the low-rank representation, we see from Eq. 12 that the time complexity required to We will add some text to the camera-ready version of our paper to make this point clear. Learning signed determinantal point processes through the principal minor assignment problem. Learning determinantal point processes by corrective negative sampling.


the clarity of the paper and the experiments that we curated to study in detail the strengths of our method, disentangling

Neural Information Processing Systems

We wish to thank the reviewers for their time and thorough reviews. Our goal is to tackle high-dimensional problems (e.g. Ant has an observation space of 111 and action space of 8, we'll All of them suffered from the leaking patch problem. How this keeps planning from exploiting the inaccuracies of DAE is easiest to see in gradient-based optimization. V AE) has a spurious maximum.


backpropagate through an equilibrium state of the network (which, to the best of our knowledge, no deep approaches

Neural Information Processing Systems

We thank the reviewers for their valuable feedback. The way DEQ "ignores" depth and solves for the equilibrium suggests a different view of output modeling and further We also agree with the reviewers that the runtime discussion should be moved into the main text. We thank reviewer #1 for the valuable feedback. DEQ approach is very different from techniques like gradient checkpointing (GC). It is an implementation-based methodology that is practical on almost any layer-based network. Quantitatively, we have followed the reviewer's suggestion and compared GC and DEQ using a 70-layer TrellisNet (w/ We find that GC works best when we checkpoint after every 9 layers, and record a 5.2GB The training speed of GC is approximately 1.6 We thank reviewer #3 for the comments, and for taking the time to check our proof and read our code.





We thank the reviewers for their kind comments and for their consensus view that our theoretical results on TV modulus

Neural Information Processing Systems

We are also thankful for the reviewers' concrete suggestions on improving the draft, We agree with the reviewers that our proposed estimators are not computationally efficient. We work in the high-temperature regime i.e., We agree with the reviewer that our estimator doesn't recover the true model even in Model Width as relaxed parameter . We thank the reviewer for raising this subtle issue.